Skip to content

Conversation

@dprevoznik
Copy link
Contributor

@dprevoznik dprevoznik commented Jan 23, 2026

feat(templates): Implement Gemini Computer Use templates for TypeScript and Python

Summary

This PR introduces full-fledged Gemini Computer Use templates for both TypeScript and Python. The templates implement Google's Gemini 2.5 Computer Use model using Kernel's Computer Controls API for browser automation.

Changes

New Templates

  • TypeScript: Complete rewrite of gemini-computer-use template with native Kernel integration
  • Python: New gemini-computer-use template following the same architecture

Architecture

Both templates follow a modular design based on Google's computer-use-preview reference:

File Description
index.ts / main.py Entry point with Kernel app registration and action handler
loop.ts / loop.py Gemini sampling loop - orchestrates model calls and action execution
session.ts / session.py Browser session management with optional replay recording
tools/computer.ts / tools/computer.py Maps Gemini actions to Kernel's Computer Controls API
tools/types/ Type definitions for Gemini function calls

Supported Actions

Action Description
click_at Click at coordinates (x, y)
hover_at Move mouse to coordinates
type_text_at Click and type text at coordinates
scroll_document Scroll page (up/down/left/right)
scroll_at Scroll at specific coordinates
navigate Navigate to a URL
go_back / go_forward Browser history navigation
key_combination Press key combinations (e.g., "ctrl+c")
drag_and_drop Drag from one point to another
wait_5_seconds Wait action

Key Features

  • Native Kernel Integration: Uses Kernel's Computer Controls API directly instead of third-party browser automation libraries
  • Replay Recording: Optional video replay recording for debugging (paid plans only)
  • Context Management: Intelligently manages screenshot history to stay within context limits
  • Coordinate Normalization: Handles Gemini's normalized coordinates (0-1000) to actual screen dimensions
  • Local Development: Both templates support local execution via npx tsx index.ts / direct Python execution

CLI Updates

  • Added TemplateGeminiComputerUse constant and template metadata in pkg/create/templates.go
  • Templates prioritized in selection list alongside Anthropic and OpenAI computer use templates

Usage

# Create a new project
kernel create my-app --language typescript --template gemini-computer-use

# Deploy
kernel deploy index.ts --env-file .env

# Invoke
kernel invoke ts-gemini-cua cua-task --payload '{"query": "Navigate to https://example.com and describe the page"}'

# With replay recording
kernel invoke ts-gemini-cua cua-task --payload '{"query": "...", "record_replay": true}'

Requirements

Testing

Templates have been manually tested with various browser automation tasks.

Related

  • Closes KERNEL-870

Note

Adds native Gemini Computer Use support across both languages and aligns docs/tests/CLI metadata.

  • New Python template python/gemini-computer-use with main.py, loop.py, session.py, and tools/ implementing Kernel Computer Controls and Gemini sampling loop
  • TypeScript template rewrite typescript/gemini-computer-use: replaces Stagehand with native Kernel APIs; adds loop.ts, session.ts, tools/; updates index.ts, README, and dependencies
  • CLI templates registry: marks GeminiComputerUse as available for Python and TypeScript; sets concrete invoke commands for both languages
  • QA docs: adds py-gemini-cua to matrix, create/deploy steps, invoke commands; updates totals (18 apps/21 tests)
  • Tests: removes "gemini-computer-use not available for python" cases and related unavailability checks

Written by Cursor Bugbot for commit 5e859dc. This will update automatically on new commits. Configure here.

Implement browser control agents using the Gemini 2.5 Computer Use Preview
model (gemini-2.5-computer-use-preview-10-2025) with Kernel's Computer
Controls API.

## TypeScript Template Changes
- Refactored index.ts to use modular architecture
- Added loop.ts: Core sampling loop implementing Google's agent pattern
- Added session.ts: KernelBrowserSession for browser lifecycle management
- Added tools/computer.ts: Maps Gemini actions to Kernel Computer Controls
- Added tools/types/gemini.ts: Type definitions for Gemini actions
- Updated package.json: @google/genai ^1.0.0

## Python Template (New)
- main.py: Action handler with KernelBrowserSession context manager
- loop.py: Async sampling loop matching TypeScript implementation
- session.py: Async context manager for browser lifecycle
- tools/computer.py: Gemini-to-Kernel action mapping
- tools/types.py: Dataclasses and enums for Gemini actions

## Integration Pattern
Both templates implement Google's recommended Computer Use agent loop:
1. Send request with computerUse tool configured
2. Receive function_call from model (click_at, type_text_at, navigate, etc.)
3. Execute via Kernel Computer Controls API
4. Capture screenshot + URL, send as function_response
5. Loop until task complete

## Key Features
- Coordinate denormalization (Gemini 0-1000 → Kernel pixels)
- Screenshot pruning to manage context size
- Optional replay recording for debugging
- Safety decision handling (auto-acknowledge for automation)

Tested: Both templates successfully navigate to Wikipedia and extract
the featured article title.
Minor cleanup following Option A from the alignment analysis:

## Removed Unused Code
- Python: Removed unused EnvState class from types.py
- Python: Removed unused GeminiAction and Optional imports from loop.py
- TypeScript: Removed unused EnvState interface from gemini.ts

## Fixed Type Hints
- Python: Use GeminiFunctionArgs TypedDict instead of Dict[str, Any]
- Python: Export GeminiFunctionArgs from tools package

## Fixed Naming Consistency
- Python: Renamed SCREENSHOT_DELAY_MS to SCREENSHOT_DELAY_SECS (was already in seconds)

## Improved Error Messages
- Added deployment hint to GOOGLE_API_KEY error
- Added payload format hint to query validation error

Net result: -15 lines of unused code, better type safety, clearer errors.
Add in new gemini computer use templates to qa.md flow
Remove excessive JSDoc/docstring comments on private helpers and simple
type definitions that don't need explanation. Keep only necessary
comments that document non-obvious behavior.
…dling

- Changed default viewport dimensions from 1024x768 to 1200x800.
- Refactored session info construction in the stop method to avoid errors if the session wasn't started, ensuring safer handling of session data.
- Changed the variable used for capturing screenshots from `result.screenshot` to `result.base64_image` for predefined functions, ensuring compatibility with the updated response structure.
update default viewport size
Update DEFAULT_SCREEN_SIZE from 1024x768 to 1200x800 to match the actual
browser viewport created in session.ts/session.py. Mismatched dimensions
caused incorrect coordinate denormalization for Gemini's 0-1000 scale.
…_SIZE

Import DEFAULT_SCREEN_SIZE from types into session files so viewport
dimensions are defined in one place. Users can now update dimensions
by editing only the types file.
cursor[bot]

This comment was marked as outdated.

@github-actions

This comment was marked as resolved.

- Introduced `getSystemPrompt` function in both TypeScript and Python templates to generate the system prompt dynamically, including the current date.
- Updated the sampling loop to utilize the new function, improving code readability and maintainability.
- Deleted the test case for the Gemini Computer Use template for Python, as it is no longer available.
- Updated the list of unavailable template-language combinations accordingly.
@dprevoznik
Copy link
Contributor Author

🔧 CI Fix Available

I've pushed a fix for the CI failure. The tests were expecting gemini-computer-use to not be available for Python, but the template now supports both TypeScript and Python.

👉 Click here to create a PR with the fix

Handled in this commit 08ff1aa

cursor[bot]

This comment was marked as outdated.

- Corrected the Google AI API key link in both Python and TypeScript README files.
- Updated the Kernel documentation link to point to the Computer Controls section for better clarity.
- Updated the magnitude assignment in the ComputerTool class to use nullish coalescing (??) instead of logical OR (||) for better handling of undefined values.
@dprevoznik
Copy link
Contributor Author

Conducted multiple agent reviews + deslop commands + bugbot fixes.

@dprevoznik dprevoznik requested review from ehfeng and rgarcia January 23, 2026 15:58
- Added descriptions for `open_web_browser` and `search` actions in both Python and TypeScript README files to enhance clarity on browser functionalities.
…uterTool

- Eliminated the `screenshot` attribute from the `ToolResult` class and its usage in the `ComputerTool` class, streamlining the data structure and focusing on the `base64_image` representation.
@dprevoznik dprevoznik requested review from Sayan-, juecd and masnwilliams and removed request for ehfeng, juecd and rgarcia January 23, 2026 21:46
@dprevoznik
Copy link
Contributor Author

Both templates for Gemini CUA worked very well in testing.

replays.9.mp4

Copy link
Contributor

@Sayan- Sayan- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm. my comments aren't hard blocking but would prefer to address now.

Resolves conflicts by including both Gemini and Yutori computer use
templates for Python.
- Added an `error` field to the `QueryOutput` type in both Python and TypeScript templates to capture and return error messages.
- Updated the `sampling_loop` function in both languages to handle exceptions and store error messages, improving feedback during execution.
- Adjusted return values to include the new `error` field, ensuring consistent output structure across templates.
refactor(loop): simplify Gemini client initialization in sampling loop

- Removed environment variable dependencies from the Gemini client initialization in the `sampling_loop` function, streamlining the setup process.
- The client is now instantiated solely with the API key, enhancing clarity and reducing complexity in the code.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

…ns dynamically

- Updated the `PREDEFINED_COMPUTER_USE_FUNCTIONS` in both Python and TypeScript templates to derive directly from the `GeminiAction` enum, ensuring consistency and reducing maintenance overhead when adding new actions.
@dprevoznik dprevoznik removed the request for review from masnwilliams January 25, 2026 15:48
…ion option for python

- Added error logging in the TypeScript template to print error messages when present in the result.
- Implemented a local execution block in the Python template to run tests directly, including error handling and logging for better debugging feedback.
@dprevoznik dprevoznik requested a review from Sayan- January 25, 2026 16:45
@dprevoznik dprevoznik merged commit 2e78ffc into main Jan 26, 2026
2 checks passed
@dprevoznik dprevoznik deleted the danny/kernel-870-gemini-cua-templates branch January 26, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants